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Abstract. Maintenance is a dominant component of software cost, and 
localizing reported defects is a significant component of maintenance. We 
propose a scalable approach that leverages the natural language present 
in both defect reports and source code to identify files that are potentially 
related to the defect in question. Our technique is language-independent 
and does not require test cases. The approach represents reports and 
code as separate structured documents and ranks source files based on a 
document similarity metric that leverages inter-document relationships. 

We evaluate the fault-localization accuracy of our method against both 
lightweight baseline techniques and also reported results from state-of- 
the-art tools. In an empirical evaluation of 5345 historical defects from 
programs totaling 6.5 million lines of code, our approach reduced the 
number of files inspected per defect by over 91%. Additionally, we quali- 
tatively and quantitatively examine the utility of the textual and surface 
features used by our approach. 



1 Introduction 

Maintenance tasks can account for up to 90% of the overall cost of software 
projects [9, 17]. A significant portion of that cost is incurred while dealing with 
software defects [35]. Large software projects typically use defect reporting sys- 
tems that allow users to submit reports directly; this has been shown to improve 
overall software quality [2, 36]. User-submitted defect reports vary widely in util- 
ity [21]; reports go through triage to allow developers to focus on those reports 
that are most likely to lead to a resolution. We propose a system to make the 
maintenance process more efficient by reducing the cost of localizing faults by 
leveraging this user-provided information. 

Fault localization is the process of mapping a fault (i.e., observed erroneous 
behavior) back to the code that may have caused it. Performing fault localization 
is relatively time consuming [41] and thus costly. For this reason, many existing 
techniques attempt to facilitate this process. In general, such techniques rely 
on test cases [1,13,23,37,42], model checking [6,7], or remote monitoring [31, 
32]. These approaches may not be directly applicable to user-submitted defect 
reports, since reports rarely provide a full test case or program trace [21]. 

In this paper, we address the cost of such localization for user-submitted 
defect reports. We present a lightweight approach that maps defect reports to 
source code locations. Our approach relies primarily on textual features of both 



source code and defect report descriptions, although it takes advantage of certain 
additional information (e.g., stack traces, version control histories) when they are 
available. Notably, it does not require test cases, compilation, execution traces, 
or remote sampling, all of which can potentially limit the applicability of other 
fault localization strategies. 1 

Our approach is based on several underlying assumptions about the textual 
features of both source code and defect reports. With respect to code, we assume 
that developers choose identifier names and comment text that are representa- 
tive of observable program behavior. For defect reports, we assume that the 
reporters use a vocabulary based on their observations of program behavior — 
a vocabulary that will thus be in some ways similar to developers', although re- 
porters may not have access to the source. Finally, we hypothesize that a defect 
report and a code location are more likely to pertain to the same fault if they 
are similar in terms of word usage; we formalize this in a similarity metric. 

The main contributions of this research project are thus: 

— A lightweight, language-independent model that statically measures similar- 
ity between defect reports and source files for the purpose of locating faults. 
This comparison is based on a structured textual analysis of the natural 
language in both documents. 

— A large empirical evaluation of our technique including 5345 real-world de- 
fects from three large programs totaling 6.5 million lines of code — over an 
order of magnitude larger than the evaluations in previous work [13, 23, 37]. 
Our approach reduces the number of files developers inspected per defect 
by 91.5%, outperforming baselines such as using user-reported stack traces 
(53.1%) and previously-published results. 

— A quantitative and qualitative explanation of our technique's success. No- 
tably, we show that factors such as the reported priority or the number of 
duplicate reports present are not related to our model's success. Instead, 
we find that human word choice determines the performance of our model, 
thus supporting our hypothesis that the vocabulary chosen by developers 
and reporters can be used to localize faults. 

The structure of this document is as follows. In Section 2, we motivate our 
approach by presenting an example fault with its associated defect report and 
source code. Section 3 outlines our approach and formally defines how we mea- 
sure the relative similarity between code and defect report text. Next, Section 4 
presents a detailed empirical evaluation of our approach. Section 5 places our 
work in context. Finally, Section 6 concludes. 

2 Motivating Example 

In this section, we present an example defect report taken from the Eclipse 
project. This example illustrates the potential benefit of matching the natural 

1 Strictly speaking, this paper presents a defect localization approach, but fault local- 
izaton is the term of art used for this line of research (e.g., [1, 23, 37, 42]). 



language in a defect description with keywords from the source code for the 
purpose of identifying the defect's location. 

User-submitted defect reports typically consist of a free-form textual de- 
scription of the fault. When presented with such a defect report, it is up to 
the developer to derive and locate the cause of the undesirable behavior. This 
requires thorough familiarity with the code base; for large projects, an impor- 
tant part of the triage process is finding which developer is most likely to be 
able to resolve a given defect report [3] . There are significant differences among 
developers in terms of how quickly they can locate a given fault [41]. Our goal 
is to narrow the source code search space that the developer needs to consider, 
thereby decreasing the software maintenance cost overall. 

Consider the following defect report from the Eclipse project, defect #91 543, 
entitled "Exception when placing a breakpoint (double click on ruler)." The 
description is as follows: 

With M6 and also with build 120050414-1107 
i get the stacktrace below now and then when 
wanting the place a breakpoint when double 
clicking in the editor bar. if i close the 
editor and reopen it again it goes ok. 

! MESSAGE Error within Debug UI : 
! STACK 

org . eclipse . jf ace . text . BadLocationException 
at 

org . eclipse . j f ace . text . AbstractLineTracker . get- 
Linelnf ormation(AbstractLineTracker . java: 251) 

Initially, a developer might be inclined to inspect code implicated directly. In 
this case, one might check the AbstractLineTracker file and other files in the stack 
trace, or search the list of all files that reference a BadLocationException. Addi- 
tionally, one might scan the files that were changed prior to either of the particu- 
lar builds mentioned. Finally, based on basic searching (using a tool such as grep) 
one might uncover any of the following files: Breakpoint. java, MethodBreakpoint- 
TypeChange.java, BreakpointsLocation.java, and TaskRulerAction.java, among 
hundreds of others. This example illustrates that the search space is large, even 
when a programmer uses the defect report's specific information. 

In the actual patch for this defect, developers edited only two source files. 
ToggleBreakpointAction.java contained the majority of changes that addressed 
this defect report, with one minor change to a call-site in RulerToggleBreak- 
pointActionDelegate.java. Some of the methods in those files include: 

ToggleBreakpointAction( . . . , IVerticalRulerlnf o 
ruler Info) 

ToggleBreakpointAction.reportException( 
Exception e) 

RulerToggleBreakpointActionDelegatecreateAction( 



ITextEditor editor, IVerticalRulerlnf o 
ruler Info) 

The identifier names associated with these two files show clear language over- 
lap with the report above. For example, even when only the report title and the 
method names are considered, key words such as breakpoint, exception and ruler 
occur in both sets. When examining the overall word similarity, the two files 
that were changed for the fix are among those files most similar to the text in 
the defect report. Using textual similarity not only avoids unrelated methods 
considered by traditional search techniques [20], but further limits the fault lo- 
calization search space by trimming files with coincidental or narrow language 
overlap. Aggregating overall word similarity ensures that only documents with 
considerable and meaningful similarity are favored. 

The log messages associated with software repositories represent another pos- 
sible source of human-chosen natural language information to leverage when lo- 
calizing faults. The ToggleBreakpointAction.java file accumulated seventeen log 
messages over four years worth of changes. Examples of these log messages in- 
clude: 

— "Can't set a breakpoint on the first line of an editor" 

— "Allow multiple debuggers to create breakpoints using the 
same editor . " 

— "NullPointerException when trying to set 
breakpoint (in ToggleBreakpointAction) " 

Terms such as breakpoint, editor and exception occur in both the log messages 
and the defect report, suggesting that this file may be relevant. Repository log 
messages are typically written in plain natural language which can be extracted 
with minimal analysis effort, much like comments in source code. Furthermore, 
repository log messages are written to chronicle the changes made to a given 
source code file. In addition, as many of these changes attempt to address pre- 
vious defects, it is reasonable to assume they might contain specific terms taken 
from previous defect reports themselves. 

We hypothesize that prioritizing the search space by ranking files of interest in 
this manner can greatly facilitate fault localization — using only static, natural 
language information, such as the defect report, source code and log messages. 
In the next section, we present a model to take advantage of this intuition. 

3 Methodology 

Our goal is to reduce the cost of software maintenance, focusing on fault or defect 
localization. The available input includes a defect report describing a fault, as 
well as static textual software development artifacts, such as the project source 
code and revision history. The desired output is an ordered list of source files 
that are likely to contain the cause of that fault. 



To fix the defect in question, such a list can be explored directly or further 
refined, depending on size of the system and resources available. While the re- 
sulting lists can still be quite large and must be processed manually, previous 
work has shown that such filtering localizations are helpful [6] . More specifically, 
over a similar set of defect reports accompanied by lists of methods (i.e., back- 
traces or counterexamples), humans were shown to take less time to address 
those defect reports in which an additional tool-generated annotation narrowed 
that information down to a smaller number of lines [43, Fig. 5]. Fault localization 
information might also be used as an attachment on the original defect report: 
in a study of over 27,000 historical defect reports, those including similar attach- 
ments and comments (e.g., backtraces or lists of methods) were more likely to 
be resolved rapidly [21, Fig. 7]. 

Since defect reports and development text have different formats, we pro- 
pose to map both of them to structured document intermediate representations. 
These intermediate representations reflect, but simplify, the structure of the orig- 
inal documents. We then build a model based on pairwise relationships between 
subparts of each document, and rank each source file accordingly. In Section 3.1 
we formalize a general document representation and then explain how we com- 
pare various sub-representations in Section 3.2. Finally, we formalize the overall 
technique in Section 3.3. 

3.1 Structured Document Representation 

Both defect reports and source files are represented as distinct structured doc- 
uments. In this paper, we use structured document to denote a set of (name, 
value) pairs (called "features"), where values are well- typed and drawn from 
non-overlapping parts of the original artifact. The types considered include num- 
ber, string, list of strings, and term frequency vector. A term frequency vector 
is a mapping from terms (i.e., words) to the frequencies with which they appear 
in a given text. Term frequency vectors are often used in natural language pro- 
cessing; we use them to represent unbounded freeform text such as defect report 
descriptions or source code comments. 

We map defect reports to our intermediate form directly. Defect reports are 
structured natural language files containing multiple parts, such as title, de- 
scription, optional stack trace, project versions affected, and operating systems 
affected [21]. We first focus on the natural language title and description. We 
break the text into a list of terms by splitting on whitcspace and punctuation 
and converting each term to all lowercase characters; we then construct term fre- 
quency vectors from the resulting multiset of words. Additionally, we also parse 
and record categorical data, such as the operating system and software ver- 
sion, representing them as discrete values in the structured document (e.g., as 
strings and numbers). Finally, we parse any stack traces into ordered sequences 
of strings. 

Source code, which is not expressly written in natural language, is handled 
similarly, but with a few extensions that have been shown to be effective in 
previous work involving textual analysis [39, 45]. We obtain an initial list of terms 



by splitting on whitespace and punctuation. However, we obtain further refined 
terms by taking advantage of paradigms such as Hungarian notation, camel case 
capitalization, and the use of underscores to separate terms in a single string [39] . 
For example, given the string "nextAvailableToken" we increment frequencies for 
the following terms: "next", "available", "token", and "nextAvailableToken". 
Source files are also structured and can also be decomposed into substructures. 
Substructures include method signatures, method bodies, comments, and string 
literals, among others. In addition to the overall term frequency vector for the 
entire file, each substructure is processed separately into its own term frequency 
vector. Thus our intermediate representation for a source file will include a term 
frequency vector for words in comments, one for words in method bodies, and 
so on. 

Finally, many mature software projects also maintain revision information. 
We utilize two specific forms of version control information: human-written 
change log messages and frequency of revision ("code churn"). When a devel- 
oper makes changes to one or more files, the common practice is to include an 
informative message when updating a central repository. These messages often 
explain both "what" the change does (e.g., "add bounds checks when receiving 
socket data") and "why" it was made (e.g., "fix high-priority buffer overrun in 
networking code") [11,46]. These messages thus relate developer concepts and 
vocabulary to particular files, and can thus be used to aid fault localization. 
The logs are parsed as basic natural language and included as a substructure 
of all relevant source files. Additionally, we record the dates at which a file is 
changed over the lifetime of the project as a source code feature to compare with 
the submission date of the defect report. This feature may be helpful in fault 
localization as previous work has shown that historical "code churn" is often a 
good predictor of which files will be changed in the future [34] . 

Once we have reduced the source code and the defect report to structured 
documents, we can compare their substructures pairwise to determine their sim- 
ilarity. Figure 1 shows an example of this overall approach, with only some of 
the pairwise comparisons highlighted. Next, Section 3.2 describes our approach 
to comparing term frequency vectors, and Section 3.3 shows our overall model 
for fault localization. 

3.2 Textual Document Similarity 

Intuitively, two documents are similar when their subparts have a large fraction 
of their terms in common. The more terms the two corresponding pieces of text 
share, we assume, the more related concepts they both describe. 

In practice, some terms arc more indicative of underlying similarity than 
others. For example, terms such as "int", "class" or "the" may occur frequently 
in two unrelated documents. We wish to limit the impact of such terms on our 
similarity metric. However, since we desire a language-independent approach, 
rather than hand-crafting an a priori stop- list of common words to discount, we 
derive that information from the set of available defect reports and source code. 
Intuitively, two documents that share a rarer term, such as "VerticalRuler" , 
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Fig. 1. Architecture for fault localization via natural language. Defect reports and 
source code are mapped down to structured documents. The substructures can then 
be compared pairwise (not all shown) using separate metrics (e.g., the report title text 
might be compared to the source file method names using a term frequency vector 
comparison, while the report stack trace might be compared to the source file method 
names using a positional index). The overall similarity, and thus the fault localization 
rank of that source file, is the weighted sum of the substructure similarities. 



should be measured more similar than two documents that share a common 
term such as "int" . 

To formalize this intuition we use the term frequency — inverse document 
frequency (TF-IDF) measure [24], which is common in information retrieval tasks. 
We want to measure how strongly any given term describes a document with 
respect to a set of context documents. Given a document d and a term t, the 
TF-IDF weighting is high if t occurs rarely in other documents, but relatively 
frequently in d. Conversely, a low weight corresponds to a term that is frequent 
globally and/or relatively infrequent in d. The weight for a document d and term 
t is computed as follows: 

,n # occurrences of t in d 

tf(*, d) = ~ : 

size of d 

■tfu.\ $ °f documents 

# of documents containing t 

Note that idf is defined with respect to a corpus of available documents. In 
our experiments, when comparing against a given defect report, the corpus is 
taken to be all terms in all source files, log messages, and defect reports filed 
before that report in question. With the background formalisms thus described, 
we now explain how we combine them to aid fault localization. 

In our approach, the overall similarity between a defect report and a source 
file is built up from the pairwise similarities between their substructures (e.g., 
term frequency vectors). We wish to empirically determine which pairwise com- 
parisons are the most predictive of fault localization when the two structured 
documents in question are compared. For instance, previous work has shown 



that defect report titles are highly significant when searching for duplicate re- 
ports [22] ; we hypothesize that they may be similarly significant when attempting 
to locate defects. Section 3.3 describes such a weighting. 

3.3 Our Technique 

We build on a portion of the TF-IDF formalism to form our overall similarity 
metric: 



For each term contained in both documents' parts, we multiply the product of 
its frequencies in both documents subparts by that term's idf weight. While we 
use idf in a standard way, we measure term frequency based only on the number 
of occurrences of a given term without normalizing based on document size. The 
aggregate sum over all words' values then serves as the similarity measure for 
those two documents' subparts. 

A major distinction between this metric and standard approaches is that we 
do not normalize for the size of the documents. While normalization is natural 
in many information retrieval tasks, we claim that the special structure of source 
code and the fault localization task make it undesirable here. For example, con- 
sider a defect report that mentions the term "VerticalRuler" in a project where 
the only source code reference to that term occurs inside one very large source 
file. In such a case, we would like to report that single source file as very similar 
to the defect report. However, if the file's size were normalized, it would appear 
to be less similar to the defect report than smaller files that share more common 
terms (e.g., "database"). Additionally, large projects often contain many code 
clones, and while not all cloning is harmful [25], much of it is inconsistent: for 
example, code clones are changed consistently a mere 45-55% of the time [28]. 
In standard information retrieval, near-duplicate text may be an uninteresting 
search result, but when looking for defects, near-duplicate code clones should all 
be considered. We want to account for the possibility of higher concentrations 
of code clones in larger files and not discount the associated natural language 
artifacts based on file size alone. In general, other works apply size-based normal- 
ization when large documents increase false positive rates or otherwise degrade 
the accuracy of a given method. We claim that the loss of precision associated 
with normalization outweighed the benefits it provided. This is in line with pre- 
vious claims [39] that traditional information retrieval search techniques used 
for documents do not map perfectly to code-based textual analysis. 

While the above technique is intended for use with two term frequency vec- 
tors, we require certain adaptations for other types of structured data. Categor- 
ical data, such as operating system flavors or program versions, are treated as a 
vector with a single term and the metric can be used in the standard fashion. 
Stack trace vectors — sequences of strings representing method names — are 
compared as word vectors by using the inverse positional index of a method in 
the call trace name as its frequency (thus weighting the first method the highest). 
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Finally, we incorporate the idea of code "churn" into the technique as it 
has shown to correlate with defect density and is readily measurable given a 
source repository [34]. Formally, we measure a source file's degree of "churn" 
by counting the number of times the file was changed during a set window of 
time. In our experiments, we used the entire available source history as the time 
window, equating code churn with the number of changes that had been checked 
in against a file. Similar to categorical data, we treat "churn" as a vector with a 
single term. 

Given a defect report D and set of source files /i . . . /„ , our goal is to produce 
a rank-ordered list of the files, weighted such that files likely to contain the defect 
are at the top. Human developers then inspect the files on the ranked list in order 
until the fault has been localized. The rank of a file is given as follows: 



where Vj ranges over all of the term frequency vectors in the defect report's in- 
termediate representation, Vk ranges over all of the term frequency vectors in the 
source file's intermediate representation, and each Cjk is a weighting constant for 
that particular vector pair. The Cjk constants are the formal model: a high value 
indicates that similarity in the associated pair of sub-substructures (e.g., defect 
report title paired with source code comments) is relevant to fault localization. 

One approach would be to use machine learning or regression to determine 
the values for the Cjk weightings. The size of our dataset, which includes tens 
of millions of datapoints and all terms in over 48,000 files and 6.5 million lines 
of code, precludes such a direct approach, however. Attempts to apply linear 
regression to the dataset failed to terminate on a 36 GB, 64-bit eight-core 3.6 
GHZ machine within four hours. For scalability, we instead use several common 
statistics as a starting point for a parameter space optimization to obtain a 
model (see Section 4.2). 

4 Evaluation 

We conducted two main experiments to evaluate our approach. The first directly 
compares the accuracy of our technique to other lightweight baselines at file- 
level localization and indirectly compares to state-of-the-art techniques. The 
second experiment quantitatively verifies our hypothesis that fault localization 
via textual analysis depends significantly on human word choice. 

4.1 Subject applications and defects 

The experiments used three large, mature open source programs and 5345 total 
defect reports, shown in Figure 3. 

We chose these projects for several reasons. First, they are relatively indica- 
tive of substantial, long-term real- world development in terms of size (at least 6.5 
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Project (Date checked out) 


Source files with defects 


Total source files 


Percentage 


Eclipse (2009.09.08) 


2660 


21303 


12.49% 


Mozilla (2009.09.08) 


1811 


6179 


29.31% 


Opcnoffice (2009.09.22) 


1463 


1507 


97.08% 


Total 


5934 


28989 


20.47% 



Fig. 2. Distribution of files with defects throughout the subject applications. In this 
case, a file is said to contain a defect if it was changed in a repository revision that 
specifically mentioned fixing a certain defect, and addressing that defect involved only 
changes to source files. 

million lines of code total) and maturity (each is 8 to 11 years old). Additionally, 
each project has both defect report and source code repositories. 

For each program, we obtained the subset of the available defect reports for 
which we could establish a definitive link between the report and a corresponding 
set of changes to source files. We thus restricted attention to those defect reports 
that were mentioned by number in source control log messages. We additionally 
restricted attention to reports of actual faults, omitting feature requests and 
other invalid or duplicate reports filed using the defect report system. Also, we 
only considered defects for which all corresponding changes took place in source 
files (i.e., . java, .cpp, etc., but not .xml) in the main branch of each project 
(e.g., omitting changes to minor branches, testing branches, or data files). 2 Ac- 
cordingly, the numbers for "files used" and "code used" in Figure 3 correspond 
to source files in trunk of each project's repository. Finally, we excluded files or 
reports that could not be processed (e.g., from CVS or parsing errors). 

To avoid over- fitting the model to our chosen data set, we split the defects 
into separate training and testing subsets. To train the model, we selected 450 
(8%) of the defects that occurred first chronologically across all three projects. 
The model was created and refined using only these preliminary defects and the 
remaining 4915 defects were held out to evaluate the model. This suggests that 
such an approach could be implemented by collecting a minimal number of initial 
defects while achieving high accuracy as reported in the following sections. 3 

4.2 Parameter space optimization 

Our first step is to build a model relating similarity comparisons between defect 
report and source code structures to fault localization. In the terminology of 
Section 3.3, this involves determining values for the 34 distinct Cjk weights. 

2 Eclipse's /cvsroot/eclipse; Mozilla's /cvsroot; OpenOffice's / trunk. 

3 We could have used cross-validation [27] instead to help detect bias from over-fitting, 
but prefer to use holdout validation because of the large number of available dat- 
apoints and the time-series nature of the data: defect reports often make reference 
to previous defect reports [21,22]. It would not be valid to train a model on future 
defect reports and evaluate it on past ones. 



Program 


Total Defects 
Used 


Files 
Used 


Lines of 
Code Used 


Language(s) 


Avg. report 
length (lines) 


Avg. report 
title (words) 


Eclipse 


1,272 


23,601 


3,476,794 


Java 


172.535 


8.642 


Mozilla 


3,033 


14,651 


2,262,877 


Java, C++ 


316.811 


9.428 


OpenOSice 


1,040 


9,992 


815,473 


Java, C++ 


60.547 


5.623 


Total 


5,365 


48,244 


6,555,144 









Fig. 3. Subject programs used in our evaluations. "Defects" counts reports that could 
be linked to a particular set of changes. "Files" counts retrieved source files in the 
project branch, including those not involved in defect reports. "Lines of Code" measures 
the size of those source files, while "Languages" lists their programming languages. The 
last two columns measure aspects of the defect reports used. 

To build such a model we first performed a one-way analysis of variance 
(ANOVA) on a subset of the data to estimate the predictive power of each 
possible document comparison. For each defect report we consider all of the files 
that were eventually fixed by the developers and also 150 files, chosen at random, 
that were not. 4 We pair each such file fa with the original defect report D to 
produce one datapoint. Each datapoint has multiple associated features (i.e., the 
explanatory variables): there is one feature for each each of the 34 (vj, Vk) vector 
pairs, with the measured similarity serving as the feature value. The response 
variable for a given datapoint is set to 1 if the file was modified by developers 
and otherwise. 

An ANOVA measures the ratio of the variance explained by each model 
feature (i.e., each (vj,v k ) similarity) over the variance not explained. We use 
this ratio only as a starting point for distant values will merely yield a 
longer training search time. These ANOVA values may not be optimal because 
our final model goal is to rank order the files for fault localization and not to 
minimize the error between a model and the artificial and 1 response variables. 

The second step was to perform a principle component analysis (PCA) to 
determine the number of components that were relevant to the task of detecting 
the location of a fault in source code. Given our 34 possible document substruc- 
ture comparisons, this analysis showed that a combination of 15 accounted for 
more than 99% of the overall variance in the data. The final Cj k values were 
obtained via a gradient ascent parameter space optimization. In each iteration, 
the best model available was compared to similar models, each constructed by 
increasing or decreasing the value of a single Cjk by 10%. The comparison was 
conducted using the score metric detailed in Section 4.3. We terminated the 
process when the improvement between one iteration and the next was less than 
0.01%; this took 5 iterations. We used the final Cjk values as our formal model. 



4 The inclusion of 150 files was chosen to be as large as possible while allowing the 
problem to be tractable on available hardware; see Section 3.3. 



Test Set 


# Defects 


Our Approach 


Stack Trace 


Code Churn 


Optimal Search 


OpenOffice only 


1018 


82.728% 


57.979% 


72.755% 


75.731% 


Eclipse only 


1124 


89.937% 


56.295% 


73.131% 


91.155% 


Mozilla only 


2773 


95.359% 


50.152% 


93.860% 


87.906% 


Stack traces only 


325 


89.608% 


65.060% 


76.442% 


90.683% 


Complete set 


4915 


91.502% 


53.137% 


84.820% 


86.128% 



Fig. 4. Score values for selected techniques. The "Test Set" column lists examined 
subsets of the 4915 defects from three programs; 450 separate defects were used as 
training in Section 4.2. "Our Approach" measures the score obtained by our tech- 
nique. The "Stack trace" baseline favors files mentioned in user-provided stack traces, 
the "Code churn" baseline favors frequently-changed files, and the "Optimal search" 
baseline simulates an optimal code search based on defect report terms. 

4.3 Experiment 1 — Ability to localize faults 

Our first experiment measures the accuracy of our technique when localizing 
faults. We compare two versions of our technique against two baseline approaches 
directly. We also indirectly compare against the published results of three state- 
of-the-art tools using a common metric. 

We adopt the score metric for measuring the accuracy of a fault localization 
technique. The score metric is commonly used in fault localization research [13, 
23,37]. As described in previous work, "the score defines the percentage of the 
program that need not be examined to find a faulty statement in the pro- 
gram." [23, p. 6] For example, a ranking for an OpenOffice defect report that 
requires the user to inspect 2,000 of the 9992 files before finding the right file 
has a score of (9992 - 2000)/9992 = 80%. Higher score values indicate better 
accuracy. We apply the score metric at the file level of granularity. We report 
the average score over all defects available. 

Figure 4 shows the results. A lower baseline of 50% represents inspecting 
files in random order. Our approach outperforms all baselines over the entire 
test set (highlighted in boldface in Figure 4) and is generally better than other 
approaches in most subsets. The "Stack traces only" subset includes all defect 
reports that featured stack traces. Note that of these 4915 defect reports used 
to evaluate our approach, only 325 (6.7%) contained stack traces. 

We compare against three baselines motivated in Section 2 by mimicking 
some steps developers might take when attempting to fix a defect. In each case 
we produce a ranked list and compute a score metric to admit a direct comparison 
with our technique. The code churn baseline ranks files in descending order of 
number of changes throughout the entire history of the system up to the date of 
the defect report in question. The stack trace baseline ranks files by their position 
in any stack trace provided as part of the defect report; all files not mentioned in 
the stack trace are equally likely to be chosen after all files mentioned in the trace. 
Finally, the optimal search baseline approximates a developer using a search tool 
with some degree of domain knowledge. Given a search term, such a tool can 
return a list of all source files mentioning that term, ignoring case, ranked by 



number of occurrences. All files that do not contain the term are equally likely to 
be chosen after any files that do contain the term. The optimal search baseline 
considers every word in the defect report and uses the one that yields the best 
score result (i.e., the search term that indicates all of the relevant files and as few 
irrelevant files as possible). Note that the best search term cannot, in general, be 
known a priori by an automated technique: using only the best term is meant to 
represent human knowledge of the software system. While this baseline docs not 
perfectly model human code search, it approximates the process for the purposes 
of comparison. 

Over the entire test set, we outperform the stack trace, code churn and 
optimal search baselines by 38%, 7%, and 5% respectively. While the perfor- 
mance gain over a stack trace baseline is immediate, the lower performance gain 
over code churn overall and over optimal search within the Eclipse and Mozilla 
projects requires more of an explanation. Eclipse has the most source files out of 
the three benchmarks. Additionally, as noted in Figure 2, the bugs are localized 
to only 12.49% of the files. Both baselines thus have the potential to eliminate 
much of the search space. However, the results for both baselines for OpenOfficc 
show that this is not the general case. In addition, since both external baselines 
provide overall scores of over 84% and 86% respectively, only a 16-point and 
14-point score increase is possible in either case. In that regard, our 7-point and 
5-point increases constitute nearly 44% and 36% of the respective remaining 
room for improvement. Finally, on large projects, even small gains are signifi- 
cant: for example, the 7% score increase over the code churn baseline prevents 
an aggregate of 4,501 source files (or 611,812 lines of code) from being considered 
during the fault localization search over our entire test set. 

Our technique performed most poorly on OpenOffice defects: if only Eclipse 
and Mozilla are considered, our performance is almost 94%. This is explained by a 
quirk of the OpenOffice project: their defect reports are about four times smaller 
(see Figure 3), thus reducing one of the primary sources of textual similarity (see 
Section 3.1). 

The results presented in Figure 4 show that our tool outperforms lightweight 
baselines. We also suggest that our technique may perform better than more 
heavyweight techniques. Several state-of-the-art fault localization techniques re- 
port accuracy values for their tools in terms of the distribution of subject faults 
over the scale of possible score measures. For comparison purposes we use a 
weighted average of each score interval to calculate an overall accuracy measure 
for each approach. The tools of Jones et al. [23], Cleve et al. [13], and Renieris 
et al. [37] achieved aggregate score measures of 77.797%, 63.415%, and 56% 
respectively. The largest of these projects evaluated on 132 defects over seven 
files containing at most 560 lines of code each. While these results are measured 
on different test sets and are therefore not directly comparable, we note that 
our technique obtains a score result 14 points higher than previous work and is 
evaluated on an order-of-magnitude more defects and files. 



Document feature 


Correlation with score 


Average report length 


0.24 


Maximum report length 


0.22 


Defect lifespan 


0.20 


Rate of commenting in edited source 


0.18 


Number of duplicate reports 


0.17 


Report readability 


0.10 


Number of edited source files 


0.08 


Reported priority 


0.07 



Fig. 5. Pearson correlation between surface features and our technique's score. 

Finally, our technique is lightweight in terms of execution time. Assuming 
code files are kept indexed as word vectors, our tool always runs in under 10 
seconds per defect report and generally takes less than 1 second. 

4.4 Experiment 2 — Human word choice 

Our second experiment tests our hypothesis that our score accuracy is mainly 
due to correctly extracting and comparing the natural language chosen by hu- 
mans in defect reports and source files. We first demonstrate that our technique's 
accuracy is not dominated by other features, such as length, defect priority, or 
defect lifespan. Secondly, we alter the natural language of the subject reports 
systematically, showing that performance degrades in a proportional manner. 
Finally, we evaluate the relative predictive power of our model's features. 

We hypothesize that human-chosen natural language in defect reports and 
source code is a critical factor in our fault localization approach. We first discount 
several other potentially-prominent features in terms of predictive power with 
respect to the score accuracy of our technique. The features examined cover 
both defect reports and source code: the Flesch-Kincaid readability level of the 
report in question [18], the assigned defect priority, the number of total reports 
for a defect when considering all duplicates, the maximum report length for a 
defect, the average report length for a defect, the overall lifespan of the defect 
from reported defect to reported patch, the number of source files edited as part 
of the patch, and the rate of commenting in the edited source code. 

We calculated the Pearson correlation of all 4915 total defects' score measures 
with these features. The correlations can be found in Figure 5. It is generally 
accepted that correlations below 0.3 are not statistically significant [19]. All 
observed correlations fell well within these bounds and therefore we conclude that 
these features do not significantly affect our model. However, of all correlations, 
report length and rate of commenting had some of the highest relative values. 
This supports our claim that natural language is key to our technique's success, 
since these features typically relate directly to the natural language present. 

Next, we demonstrate that our model is greatly affected by the users' choice 
of language in defect reports and the developers' choice of language in source 
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Fig. 6. The effect of replacing human-chosen words with various random words on our 
technique's score over all 4915 defects in the test set. 



code. To evaluate this, we measure our score accuracy as more and more human- 
chosen words are replaced by random words. We used three different random 
techniques to replace human-chose words: replacing terms with words from the 
same general set (e.g., the set of all report description words), replacing terms 
with words from a different set (in this case, an English dictionary), and finally, 
replacing terms with strings of the same length made up of randomly selected 
characters (i.e., random noise). In each case, we altered the natural language 
in increments until the entire frequency vector had been changed, using the 
unaltered reports as a baseline. The results of this experiment can be found in 
Figure 6. Each datapoint represents the degradation in score of our algorithm 
running on the entire 4915-defect testing data subset with some fraction of each 
defect report's text altered. 

As the natural language in defect reports is changed, and thus the useful 
information in the report is reduced, the performance of our technique degrades. 
The reduction in score is not strictly proportional, as is expected from the pres- 
ence of common words and our use of the idf weighting: retaining even a few 
words that account for some of the relevant document similarity in a given com- 
parison degrades the performance of the tool only slightly. In addition, when 
replacing terms with words from a different corpus the performance initially in- 
creases very slightly and then decreases, following the other two replacement 
techniques. The sharp increase of degradation when all terms have been altered 
further reinforces the idea that our approach can perform accurately with even 
a small amount of natural language information and fails only when almost all 
information is changed. The general trend in Figure 6 is that performance of our 
approach degrades when natural language information is removed or altered. 
Thus, we posit that our approach is leveraging the human-chosen language and 
not additional features. 
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Fig. 7. The score results from a "singleton" analysis in which we used only one feature 
and measured the score it achieves alone. 

Finally, having established that human word choice is critical, we evaluate 
which words and comparisons are the most important. Figure 7 shows the pre- 
dictive power of the features in terms of a "leave-one-in" or "singleton" analysis. 5 
To obtain these results we built many different versions of the model, each uti- 
lizing only one of the features. Thus, the score measure for each version suggests 
the relative utility of the given feature with respect to localizing faults. 

The report title and body, as well as the method bodies and comments, are 
involved in many of the most useful relationships in our fault localization model. 
With respect to defect reports, the titles and bodies contain the majority of 
the natural language information chosen by the reporter and, as such, are more 
helpful than extraneous categorical data and stack traces. Comparatively, we 
believe that code revision log messages arc helpful because they often address 
specific defects or defect reports and thus might use similar language. Code 
churn also proves to be predictive of defect location, supporting claims made in 
previous work [34] . Intuitively, code comments are effective when paired against 
terms from defect reports because they are both written explicitly in natural 
language and often encapsulate code specifications in a manner complementary 
to the language inherent in the code's identifiers. Method bodies contain most of 
the text associated within code files and thus also serve as effective predictors. 
Finally, more obscure categorical information (e.g., processor architecture) and 
string literals found in code were less useful to the model. 



5 The size of the dataset precludes a full ANOVA; see Section 3.3. In addition, heavy 
feature overlap precludes the use of a "leave-one-out" analysis; see PCA in Sec- 
tion 4.2. 



4.5 Threats to validity 



Although our experiments are designed to demonstrate that our technique per- 
forms well over a large number of defects and files, our results may not generalize 
to industrial practice. First, our benchmark programs may not be indicative. The 
programs we chose are all large, mature, open-source projects. While they span 
three individual domains, they may not generalize to all potential domains. Our 
results may not apply to younger, smaller projects, but we claim that fault lo- 
calization becomes less interesting as the project shrinks (e.g., in the limit, fault 
localization is rarely a primary concern for a project with only a few small source 
files). In addition, all of our benchmarks involve GUI components, making them 
more likely to support our hypothesis that report-writers and developers will use 
similar textual terms. We view our evaluation on large datasets (e.g., ten times 
larger than previously-published evaluations [13,23,37]) as an advantage. 

Bird et al. note that sampling defect reports for the purpose of experimen- 
tation may lead to biased results [5,8]. As a result, our technique may only be 
good at localizing certain types of faults (i.e., those that open source developers 
deign to mention in version control logs). Lacking a project with a linked version 
control and defect repository, we cannot mitigate this threat beyond our claim 
that manual inspection of the reports found the faults to be a relatively even 
cross-section of each project's repository over the history of that project (see 
Section 4.1). 

Our code churn baseline may not be indicative because it relies on eight to 
ten years of version control information. For example, it may perform particu- 
larly well on the larger and older Mozilla project, correctly giving low rankings 
to the many files that have been stable for years. In practice, a development 
organization may not have such rich version history information, or such stable 
files may be manually excluded by developers. 

Not all files in a project will be associated with fixed defects when using our 
selection methodology from Section 4.1. Previous researchers have noted that 
in practice, defects are not uniformly distributed [34]. If our model somehow 
learned the underlying defect distribution rather than using information from 
the defect reports, it would not generalize. We guard against this threat both by 
construction (i.e., our learned features are all coefficients for similarity metrics 
between defect report substructures and software development artifacts, and not 
development artifacts alone) and also by our use of holdout validation. 

Finally, when comparing our results with those of established fault local- 
ization techniques using the score metric, we interpolated to convert previous 
distribution-style results to single score numbers and thus admit more illustra- 
tive comparisons. Previous publications have reported score value distributions 
over intervals from 0% to 100%. We estimated based on a weighted average 
of the medians of each interval. As a result, these indirect comparisons with 
previously-published results cannot used to draw firm conclusions and serve in- 
stead to provide descriptive context. 



5 Related Work 



Related research to our work falls into two main categories: prior work in fault 
localization, and prior work in reverse engineering. 

5.1 Fault Localization 

Ashok et al. propose a similar natural language search technique in which users 
can match an incoming report to previous reports, programmers and source 
code [4]. By comparison, our technique is more lightweight and focuses only on 
searching the code and the defect report. 

Jones et al. developed Tarantula, a technique that performs fault localiza- 
tion based on the insight that statements executed often during failed test cases 
likely account for potential fault locations [23]. Similarly, Renieris and Rice use 
a "nearest neighbor" technique in their Whither tool to identify faults based on 
exposing differences in faulty and non-faulty runs that take very similar exe- 
cutions paths [37]. These approaches are quite effective when a rich, indicative 
test suite is available and can be run as part of the fault localization process. 
They thus requires the fault-inducing input but not any natural language defect 
report. By contrast, our approach is lightweight, does not require an indicative 
test suite or fault- inducing input, but does require a natural language defect 
report. Both approaches will yield comparable performance, and could even be 
used in tandem. 

Cleve and Zeller localize faults by finding differences between correct and 
failing program execution states, limiting the scope of their search to only vari- 
ables and values of interest to the fault in question [13]. Notably, they focus on 
those variable and values that are relevant to the failure and to those program 
execution points where transitions occur and those variables become causes of 
failure. Their approach is in a strong sense finer-grained than ours: while noth- 
ing prevents our technique from being applied at the level of methods instead 
of files, their technique can give very precise information such as "the transition 
to failure happened when x became 2." Our approach is lighter-weight and does 
not require that the program be run, but it does require defect reports. 

More recent work conducted by Wang et al. aims to refine the concept of 
fault localization based on test suite coverage metrics [42] . They closely examine 
contextual information to detect faults that are being executed but not identified. 

Liblit et al. use Cooperative Bug Isolation, a statistical approach to isolate 
multiple defects within a program given a deployed user base. By analyzing large 
amounts of collected execution data from real users, they can successfully differ- 
entiate between different causes of faults in failing software [32] . Their technique 
produces a ranked list of very specific fault localizations (e.g., "the fault occurs 
when i > arrayLen on line 57"). In general, their technique can produce more 
precise results than ours, but it requires a set of deployed users and works best 
on those defects experienced by many users. By contrast, we do not require that 
the program be runnable, much less deployed, and use only natural language 
defect report text. 



Jalbert et al. [22] and Runeson et al. [38] have successfully detected duplicate 
defect reports by utilizing natural language processing techniques. We share with 
these techniques a common natural language architecture (e.g., frequency vec- 
tors, TF-IDF, etc.). We differ from these approaches by adapting the overall idea 
of document similarity to work across document formats (i.e., both structured 
defect reports and also program source code) and by tackling fault localization. 

5.2 Reverse Engineering 

Latent Semantic Indexing (LSI) is an information retrieval technique for mea- 
suring document similarity [14]. Similar to our technique, it uses word frequency 
vectors to measure co-occurrence of relevant terms in documents. Marcus and 
Maletic used LSI to expose document-to-source-code traceability [33]. While 
their work mainly focuses on matching documents from the initial phases of 
the development process, the work presented in this paper attempts to match 
specifically defect reports created by both users and developers throughout the 
maintenance process. Additionally, traditional LSI treats documents as a single, 
unified term frequency vector whereas our technique breaks documents down 
into substructures based on the hypothesis that certain language is more helpful 
for localizing defects. 

Li et al. have examined the problem of extracting information from structured 
documents in addition to categorizing that information [30]. They focus on user 
queries in particular, which is similar to the defect reports we study in this work. 
They also note that tailoring analyses to specific corpora is particularly helpful, 
which we confirm with the use of inverse document frequency for weighting 
individual terms. 

Devanbu et al. and Wiirsch et al. both developed software system search tools 
that leverage the natural language in both source code and related software 
artifacts [16,44]. While our system has a similar back-end natural-language- 
based approach, our overall goal is automatic fault localization, not general code 
search or program comprehension. 

Brcu et al. studied the effect of user interaction throughout the defect fixing 
process [10]. Much like the work presented in this paper, they found that ad- 
ditional information and user clarification generally only serves to aid in fixing 
defects. While their focus is on the interaction between users and developers 
throughout the maintenance process, our work aims to measure the quality of 
information contained in different parts of documents associated with defect 
fixing. 

Ko et al. have studied the overall process used by developers to find infor- 
mation and understand programs [26]. They employed a human study to gain 
insight into what kinds of information developers think is relevant to a given 
task and how they make decisions in this process. The goal of our research is 
complimentary to this work in that we are trying to automatically discern which 
information is related to specific defects and, in a broader sense, aid in the 
maintenance process on the whole. 



Shepherd et al. focused on both proving that the natural language in source 
code is meaningful and also on attempting to extract those language artifacts in 
a meaningful and useful manner [39] . They studied natural language use in code 
for the purpose of developing a specialized code-search technique specifically fo- 
cused on identifying distributed concepts throughout a system. Similarly Lawrie 
et al. have examined the quality of source code identifiers in terms of code com- 
prehension [29] . They show that insightful and carefully chosen natural language 
identifiers make for more understandable and maintainable code. We build upon 
such work by leveraging these facts in the domain of fault localization. 

Much work has been done to measure the quality of natural language choices 
made by developers [12,15,29,40]. Additionally, some of this work looks at re- 
structuring or refactoring natural language artifacts in an attempt to reverse 
engineer the original developers' intentions and aid program understanding. We 
claim that measuring the quality of natural language is orthogonal to the work 
we present in this paper. We are more concerned with the ability of the natural 
language in both defect reports and source code to localize faults, regardless 
of the language's quality. While higher quality information may allow our tool 
to compare documents more accurately, our tool currently achieves very high 
accuracy without accounting for the quality of the underlying natural language. 

6 Conclusion 

We present a lightweight, scalable technique for localizing faults based on docu- 
ment similarities. We hypothesize that human-chosen natural language present 
in both defect reports and source code can be compared to identify potential 
fault locations based on natural-language descriptions. Our technique is entirely 
static and is language independent. 

An empirical evaluation shows that our technique not only performs better 
than several baseline approaches, but is comparable to the state-of-the-art tech- 
niques without requiring significant overhead or a runnable program and a test 
suite. We also demonstrated that the word choice in natural language artifacts 
was truly the dominant factor in our approach. 

A large empirical evaluation of our program on 5345 historical defects from 
three real-world programs totaling 6.5 million lines of code showed that we can 
reduce the search space for finding a fault by over 91% on average. We believe 
that this approach has the potential to significantly decrease the cost of fault 
localization, and thus software maintenance overall. 
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